arXiv’s Statements : a preprocessing dive into arXMLiv 08.2019

Oct 1, 2019
LLamapun
arXiv
Data pre-processing
Text corpus
https://creativecommons.org/licenses/by/4.0/
We explore the labeled statements of arXiv, as marked up by the authors, and extract a dataset for supervised training.
2019-10-01

Overview

This is part 2 in a blog series going through the practical steps to extracting a statement classification dataset from arXiv.org. The first part covered a tour of arXiv’s headings and you can jump into the formal task description at our paper preprint.

This post: extract the annotated statement resource, and organize it for redistribution.

Tools

I am using our own homegrown llamapun toolkit for the data wrangling, allowing for the full preprocessing and extraction to take place in 3.5 hours on 32 logical threads, for our dataset of 1.37 million HTML5 documents. The performance is achieved via the jwalk and rayon parallel processing crates, and the generally low overhead of using Rust.

It took a new minor release 0.3.4 to get a number of small pieces in place:

  • added an extra control to paragraph length: between 4 and 1024 words

  • got closer to best practice on a couple of tokenization issues (related to apostrophes/possessives and formula lexemes),

  • made certain XPath selectors more robust, to ensure as exhaustive as possible statement coverage,

  • extended the selection scope from the 2018 release - very notably captions were added, which provided a lot of new volume,

  • fine-tune heading normalization rules for extra precision, although still heuristic and prone to edge case issues,

  • regenerated the token model and GloVe embeddings we release with the data,

  • and naturally, it finalized the statement class whitelist, using a rough threshold of 10,000 paragraphs as a bare minimum for inclusion

The final class list

In arXiv’s headings I ended on a cliffhanger: we did all the work to set up the tools, extract summaries from the data and spot deficiencies, yet we did not actually arrive at the final participants in the updated statement task. Which statements made the cut? After going through a full run to quantify the volume, and surveying some examples, we arrive at the following 46 categories:

abstract acknowledgement analysis application assumption background caption
case claim conclusion condition conjecture contribution corollary
data dataset definition demonstration description discussion example
experiment fact future work implementation introduction lemma methods
model motivation notation observation preliminaries problem proof
property proposition question related work remark result simulation
step summary theorem theory

It may appear ironic that even though we cast a wide net over all section headings, we ended up with 46, four less than the 2018 set of 50. Yet only 26 of the original 50 classes contained more than 10,000 entries – the comparison is quite ill-posed. In fact, even though we keep enhancing the quality control measures in data collection (e.g. constrained the paragraph size), the 26 of the original high volume classes are still available to be extracted from the 2019 data, with reliably higher volume of +10%percent10 or more.111Always an exception to the rule: overview is now excluded and outshined by a much more volumous summary.

A full table with the class frequencies will be available at the end, after we pass through the extraction steps.

The Exclusions

What did we ignore? We skip over all other 6+ million entries from our tentative report on heading volume (GitHub gist), not meeting our frequency requirement. Also ignored are thousands of low-frequency \newtheorem declarations via the amstheorem LaTeX package, for the same reason of low volume. Both data streams remain readily available in the arXMLiv 08.2019 release and can be extracted if/when needed for experiments that focus on breadth, rather than volume. I also excluded a few classes that usually do not contain narrative statements, but cover metadata or structured content, such as: references, appendix, algorithm and keywords.

Importantly, over a dozen low-volume/out-of-scope entries that were part of the officially released 2018 task definition are no longer extracted. They are:

affirmation answer bound comment condition constraint convention criterion
exercise expansion expectation explanation hint issue keywords note
notice principle rule solution overview

All excluded classes may return in future versions of the data, or in other task formulations, as they are certainly valid aspects of scientific discourse.

Extracting the dataset

We’ll walk through the highglights of the statement extraction example in llamapun.

The example takes as input a path to a corpus directory, containing HTML files generated by latexml, and a target filename for a Tar archive.

  $ cargo run --release --example corpus_statement_paragraphs_model \
      /data/datasets/dataset-arXMLiv-08-2019/ /var/local/statement_paragraphs_arxmliv_08_2019.tar

We use a single builder to write the archive, held in a thread-friendly mutex, so that threads can keep adding data to the same target archive without any race conditions.

  use tar::{Builder, Header};
  struct TarBuilder {
    builder: Builder<File>,
    names: HashSet<String>,
  }
  let tar_builder = Arc::new(Mutex::new(TarBuilder { // ...

Our builder also bookkeeps a HashSet with previously seen paragraph shas, to ensure each entry added to the resource is distinct and there is no overlap between classes.

The main work is done in the parallel traversal via jwalk, yielding each document locally to a thread and executing the statement extraction code. It is implemented as part of llamapun’s parallel_data::Corpus.

  let mut corpus = Corpus::new(corpus_path);
  let catalog = corpus.catalog_with_parallel_walk(|doc| {
    extract_document_statements(doc, tar_builder.clone(), discard_math_flag)
  });

Onto the extraction logic, each document is inspected as follows:

  // [skip] some document-level context variables and checks
  // ‘extended_paragraph_iter‘ covers narrative paragraphs, abstracts, captions,
  ’paragraphs: for mut paragraph in document.extended_paragraph_iter() {
    let para = paragraph.dnm.root_node; // the underlying XML node
    // ...[skip]... setup for prev_heading_opt which contains Some(heading_node)
    // when the prior sibling of the paragraph is a heading title
    // we ignore all other paragraphs, except for the specially marked up cases of acknowledgement and caption, e.g.
    let special_marker = if para_class.contains("ltx_acknowledgement") {
      Some(StructuralEnv::Acknowledgement)
    } else if para_class.contains("ltx_caption") {
      Some(StructuralEnv::Caption)
    }
    // Before we go into tokenization, ensure this is an English paragraph
    if data_helpers::invalid_for_english_latin(&paragraph.dnm) {
      continue ’paragraphs;
    }

So far we have checked that a paragraph has special markup or is preceded by a heading, as well as it being identified as English, skipping over all others. Next, we can extract the precise label, and check it is in our whitelist of 46 classes.

    // I. Determine the class for this paragraph entry, so that we can iterate over its content after
    // if no markup at all, ignore the paragraph, as we don’t have reliable classification information
    let class_directory = if let Some(env) = special_marker {
      // case 1: special markup for caption and acknowledgement
      env.to_string()
    } else {
      // case 2: AMS markup + accepted AMS class
      let ams_class = if has_ams_markup {
        let parent_class = para_parent.get_attribute("class").unwrap_or_default();
        ams::class_to_env(&parent_class)
      } else {
        None
      };
      if let Some(env) = ams_class {
        match env {
          // Other and other-like entities that are too noisy to include
          // New for 2019: ignore the low-volume cases as well
          AmsEnv::Affirmation
          | AmsEnv::Algorithm
          // |.. [skip] 19 other variants
          | AmsEnv::Other => continue ’paragraphs,
          whitelisted => whitelisted.to_string(),
        }
      } else if let Some(heading_node) = prev_heading_opt {
        // case 3: structural heading markup
        if let Some(heading_text) = data_helpers::heading_from_node_aux(
          heading_node,
          &document.corpus.tokenizer,
          &mut context,
        ) {
          let env: StructuralEnv = heading_text.as_str().into();
          if env == StructuralEnv::Other {
            // any of the other 6+ million headings that are not whitelisted, ignore
            continue ’paragraphs;
          }
          // otherwise, any of the ‘StructuralEnv’ enum variants are accepted classes
          env.to_string()
        }
    //... skip other cases

There is a lot going behind the scenes in this snippet. The ams::class_to_env performs a rather ambitious lookup, mapping latex-defined environment names, cleanly and reliably preserved in the HTML attributes, down to their canonical statement classes. The work behind that mapping was part of a survey that went over 20,000 of the author-provided AMS classes, retaining the ones with clear and robust intent.

The data_helpers::heading_from_node_aux hides quite a significant amount of logic as well. It uses llamapun’s “document narrative map” abstractions (DNM) to obtain a robust plain text version of the heading element, discarding e.g. tag markup for section numbers and cleanly stripping away styling information. It then performs normalization on the plain-text, reducing a shorltist of known compound headings such as “Proof of theorem ref” down to proof. Finally, this normalized heading string is mapped into a StructuralEnv struct, checking it against the whitelist we experimentally defined, recasting anything outside it as an “Other” label.

At this point, we have skipped all paragraphs without a whitelisted statement class. We have retained special markup, whitelisted AMS markup, and whitelisted structural heading markup. Thus, knowing this is a paragraph to retain, we need to normalize it to a plain text form, derived from its HTML node:

  // II. We have a labeled statement. Extract content of current paragraph, validating basic data quality
  let mut word_count = 0;
  let mut invalid_paragraph = false;
  let mut paragraph_buffer = String::new();
  ’words: for word in paragraph.word_and_punct_iter() {
    let word_string = match data_helpers::ams_normalize_word_range(//...
    {
      Ok(w) => w,
      Err(_) => {
        invalid_count += 1;
        invalid_paragraph = true;
        break ’words;
      }
    };
    if !word_string.is_empty() {
      word_count += 1;
      paragraph_buffer.push_str(&word_string);
      paragraph_buffer.push(’ ’);
    }
  }
  // Discard paragraphs outside of a reasonable [4,1024] word count range
  if word_count < 4 || word_count > 1024 {
    invalid_count += 1;
    invalid_paragraph = true;
  }

  // If paragraph was valid and contains text, record it
  if !invalid_paragraph {

    paragraph_buffer.push(’\n’);
    paragraph_count += 1;
    // precompute sha inside the thread, to do more in parallel
    let paragraph_filename = hash_file_path(&class_directory, &paragraph_buffer);
    thread_data.push((paragraph_buffer, paragraph_filename));
  }

This is a bit more direct. We iterate over a paragraph’s words and punctuation, and collect words for our specific use case. The ams_normalize_word_range helper allows to pass in a set of configuration options and choose whether to e.g. keep or discard math, punctuation, letter case. It also internally handles substituting the MathML representation of formulas with their sub-formula lexemes222a special feature of this dataset, provided via latexml’s tokenization of math expressions. Valid paragraphs are collected with their on-archive name prepared, for followup serialization to disk.

Lastly, having collected all appropriate paragraphs for this document, we can lock the tar builder and write the data to disk and deallocate it, keeping the RAM footprint of the traversal contained.

  // III. Record valid entries into archive target, having collected all labeled samples for this document
  let mut builder_lock = tar_builder.lock().unwrap();
  for (paragraph_buffer, paragraph_filename) in thread_data.into_iter() {
    builder_lock
      .save(&paragraph_buffer, &paragraph_filename)
      .expect("Tar builder should always succeed.")
  }
  // IV. Bookkeep counts for final report and finish this document
  thread_counts.insert(String::from("paragraph_count"), paragraph_count);
  thread_counts.insert(String::from("invalid_count"), overflow_count);
  thread_counts
}

Each statement entry is named after the SHA-256 of its contents, and is added to one of 46 subdirectories named after the statement class.

Three and a half hours later, a 40 GB tar file, containing 22.1 million statement paragraphs is ready for experimentation!

The arXMLiv Statement Classification Dataset, 2019

While the numbers and scope in this post are still tentative, I can report a very promising look into the new extraction run over the 1.37 million arXiv articles, upto 08.2019.

With 22.1 million paragraphs collected, from a total of 97.6 million, as defined by our “extended paragraph” iterator, we are retaining 22% from the total paragraph volume available in the dataset. As a very loose estimate, given that our embeddings statistitcs show 15.2absent15.2 billion tokens from all paragraphs, then we could estimate this statement set contains 3absent3 billion tokens.

Thus, for our 46 classes of choice, the distinct paragraphs extracted are ranked as follows:

Class Entries
caption 7,098,238
proof 2,719,458
lemma 1,513,073
theorem 1,510,103
abstract 1,167,923
introduction 1,056,110
proposition 940,306
definition 844,670
remark 797,994
acknowledgement 680,991
conclusion 511,117
corollary 493,600
example 390,229
model 343,543
result 299,991
discussion 192,629
summary 139,725
problem 126,985
experiment 120,689
analysis 120,661
methods 119,913
claim 94,910
observation 70,621
notation 69,567
preliminaries 68,695
property 65,284
conjecture 64,350
simulation 59,396
related work 54,910
condition 46,124
assumption 40,409
question 39,777
background 34,819
contribution 29,205
description 25,337
demonstration 24,984
fact 20,846
motivation 16,887
case 15,058
step 14,255
application 13,212
future work 12,263
implementation 10,849
data 10,589
dataset 9,738
theory 7,184
Table 1: Tentative Statement Classification Dataset, 2019 edition

For convenience, here is the same table in alphabetical order:

Class Entries
abstract 1,167,923
acknowledgement 680,991
analysis 120,661
application 13,212
assumption 40,409
background 34,819
caption 7,098,238
case 15,058
claim 94,910
conclusion 511,117
condition 46,124
conjecture 64,350
contribution 29,205
corollary 493,600
data 10,589
dataset 9,738
definition 844,670
demonstration 24,984
description 25,337
discussion 192,629
example 390,229
experiment 120,689
fact 20,846
future work 12,263
implementation 10,849
introduction 1,056,110
lemma 1,513,073
methods 119,913
model 343,543
motivation 16,887
notation 69,567
observation 70,621
preliminaries 68,695
problem 126,985
proof 2,719,458
property 65,284
proposition 940,306
question 39,777
related work 54,910
remark 797,994
result 299,991
simulation 59,396
step 14,255
summary 139,725
theorem 1,510,103
theory 7,184
Table 2: Tentative Statement Classification Dataset, 2019 edition (alphabetical)

Note on Reuse

The final tar file is at its worst containing 7 million files in a single subdirectory. Using that setup unpacked can outright lead to errors with your local filesystem, or lead to extreme slowness in operations that were not written with large directories in mind. So instead, the tar is best used by walking it directly, and re-mapping the data into another resource. Commonly, I would use the word embeddings to map each token string to its embedding id, do the same for the class and label id, and transfer that now model-specific data to an HDF5 file, ready to be used in a Jupyter notebook workflow.

I haven’t made this new statement set public yet, but certainly intend to do so shortly, after another couple of integrity checks. Until then I warmly recommend getting started with the 2018 statement classification set if modeling scientific discourse spikes your interest!